Stochastic Contextual Bandits with Known Reward Functions

نویسندگان

  • Pranav Sakulkar
  • Bhaskar Krishnamachari
چکیده

Many sequential decision-making problems in communication networks such as power allocation in energy harvesting communications, mobile computational offloading, and dynamic channel selection can be modeled as contextual bandit problems which are natural extensions of the well-known multi-armed bandit problem. In these problems, each resource allocation or selection decision can make use of available side-information such as harvested power, specifications of the jobs to be offloaded, or the number of packets to be transmitted. In contextual bandit problems, at each step of a sequence of trials, an agent observes the side information or context, pulls one arm and receives the reward for that arm. We consider the stochastic formulation where the context-reward tuples are independently drawn from an unknown distribution in each trial. The goal is to design strategies for minimizing the expected cumulative regret or reward loss compared to a distribution-aware genie. We analyze a setting where the reward is a known non-linear function of the context and the chosen arm’s current state which is the case with many networking applications. This knowledge of the reward function enables us to exploit the obtained reward information to learn about rewards for other possible contexts. We first consider the case of discrete and finite context-spaces and propose DCB( ), an algorithm that yields regret which scales logarithmically in time and linearly in the number of arms that are not optimal for any context. This is in contrast with existing algorithms where the regret scales linearly in the total number of arms. Also, the storage requirements of DCB( ) do not grow with time. DCB( ) is an extension of the UCB1 policy for the standard multi-armed bandits to contextual bandits using sophisticated proof techniques for regret analysis. We then study continuous context-spaces with Lipschitz reward functions and propose CCB( , δ), an algorithm that uses DCB( ) as a subroutine. CCB( , δ) reveals a novel regret-storage trade-off that is parametrized by δ. Tuning δ to the time horizon allows us to obtain sub-linear regret bounds, while requiring sub-linear storage. Joint learning for all the contexts results in regret bounds that are unachievable by any existing contextual bandit algorithm for continuous context-spaces. Similar performance bounds are also shown to hold for unknown horizon case by employing a doubling trick.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semiparametric Contextual Bandits

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve Õ(d √ T ) regret over T rounds, when the linear function is d-dimensional, which matches the best known bou...

متن کامل

Contextual Bandits with Stochastic Experts

We consider the problem of contextual bandits with stochastic experts, which is a variation of the traditional stochastic contextual bandit with experts problem. In our problem setting, we assume access to a class of stochastic experts, where each expert is a conditional distribution over the arms given a context. We propose upper-confidence bound (UCB) algorithms for this problem, which employ...

متن کامل

Linear Contextual Bandits with Knapsacks

We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn’t exceed the budget for each ...

متن کامل

A Survey on Contextual Multi-armed Bandits

4 Stochastic Contextual Bandits 6 4.1 Stochastic Contextual Bandits with Linear Realizability Assumption . . . . 6 4.1.1 LinUCB/SupLinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1.2 LinREL/SupLinREL . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1.3 CofineUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1.4 Thompson Sampling with Linear Payoffs...

متن کامل

PAC-Bayesian Analysis of Contextual Bandits

We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) N goes as

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1605.00176  شماره 

صفحات  -

تاریخ انتشار 2016